Measuring support for a hypothesis about a random 
parameter without estimating its unknown prior 



o 

<N 

Oh April 5, 2011 

< 
<N 

^ Running headline: Support for a random hypothesis 

B 

' — ' David R. Bickel 

(N 

^ Ottawa Institute of Systems Biology 

^ Department of Biochemistry, Microbiology, and Immunology 

O 

University of Ottawa 



> 
>< 



1 



Abstract 

For frequentist settings in which parameter randomness represents variabihty rather 
than uncertainty, the ideal measure of the support for one hypothesis over another is 
the difference in the posterior and prior log odds. For situations in which the prior 
distribution cannot be accurately estimated, that ideal support may be replaced by 
another measure of support, which may be any predictor of the ideal support that, 
on a per-observation basis, is asymptotically unbiased. Two qualifying measures of 
support are defined. The first is minimax optimal with respect to the population and is 
equivalent to a particular Baycs factor. The second is worst-sample minimax optimal 
and is equivalent to the normalized maximum likelihood. It has been extended by 
likelihood weights for compatibility with more general models. 

One such model is that of two independent normal samples, the standard setting 
for gene expression microarray data analysis. Applying that model to proteomics data 
indicates that support computed from data for a single protein can closely approximate 
the estimated difference in posterior and prior odds that would be available with the 
data for 20 proteins. This suggests the applicability of random-parameter models to 
other situations in which the parameter distribution cannot be reliably estimated. 

Keywords: empirical Bayes; indirect evidence; information for discrimination; minimum 
description length; model selection; multiple comparisons; multiple testing; normalized max- 
imum likelihood; strength of statistical evidence; weighted likelihood 



2 



1 Introduction 



The p- value has now served science for a century as a measure of the incompatibihty between 
a simple (point) null hypothesis and an observed sample of data. The celebrated advantage 
of the p-value is its objectivity relative to Bayesian methods in the sense that it is based 
on a model of frequencies of events in the world rather than on a model that describes the 
beliefs or decisions of an ideal agent. 

On the other hand, the Bayes factor has the salient advantage that it is easily interpreted 
in terms of combining with previous information. Unlike the p-value, it is a measure of 
support for one hypothesis over another; that is, it quantifies the degree to which the data 
change the odds that the hypothesis is true, whether or not a prior odds is available in 
the form of known frequencies. Although the Bayes factor does not depend on a prior 
probability of hypothesis truth, it does depend on which priors are assigned to the parameter 
distribution under the alternative hypothesis unless that alternative hypothesis is simple, in 
which case the Bayes factor reduces to the likelihood ratio if the null hypothesis is also simple. 
Unfortunately, the improper prior distributions generated by conventional algorithms cannot 
be directly applied to the Bayes factor. That has been overcome to some extent by dividing 
the data into training and test samples, with the training samples generating proper priors for 
use with test samples, but at the expense of requiring the specification of training samples 



and, when using multiple training samples, a method of averaging (Berger and Pericchi 



1996). 



On the basis of concepts defined in Section [2| Section |3] will marshal results of information 
theory to seize the above advantages of the p-value and Bayes factor by deriving measures 
of hypothesis support of wide applicability that are objective enough for routine scientific 
reporting. While such results have historically been cast in terms of minimum description 
length (MDL), an idealized minimax length of a message encoding the data, they will be 
presented herein without reliance on that analogy. For the present paper, it is sufficient to 
observe that the proposed level of support for one hypothesis over another is the difference 



in their MDLs and that Rissanen (1987) used a difference in previous MDLs to compare 
hypotheses. 

To define support in terms of the difference between posterior and prior log-odds without 



relying on non-frequency probability, Section 2.2 will relate the prior probability of hypothesis 
truth to the fraction of null hypotheses that are true. This framework is the two-groups 



model for the analysis of gene expression data by empirical Bayes methods (Efron et al. 



2001) and later adapted to other data of high- dimensional biology such as those of genome- 



wide association studies (Efron, 2010b, Yang and Bickel, 2010, and references) and to data 
of medium-dimensional biology such as those of proteins and metabolites (Bickel, 2010a|[b ). 
In such applications, each gene or other biological feature corresponds to a different random 
parameter, the value of which determines whether its null hypothesis is true. 

While the proposed measures of hypothesis support fall under the two-groups umbrella, 
they are not empirical Bayes methods since they operate without any estimation or knowl- 
edge of prior distributions. Nonetheless, the unknown prior is retained in the model as a 
distribution across random parameters, including but not necessarily limited to those that 
generate the observed data. 

Thus, the methodology of this paper is applicable to situations in which reliable estima- 
tion the unknown two-groups prior is not possible. Such situations often arise in practice. For 
example, the number of random parameters for which measurements are available and that 
have sufficient independence between parameters is often considered too small for reliable 



estimation of the prior distribution. Qiu et al. (2005) argued that, due to correlations in ex- 
pression levels between genes, this is the case with microarray data. Less controversially, few 
would maintain that the prior can be reliably estimated when only one random parameter 
generated data, e.g., when the expression of only a single gene has been recorded. Another 
example is the setting in which the data cannot be reduced to continuous test statistics that 
adequately meet the assumptions of available empirical Bayes methods of estimating the 
prior distribution. 
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Section |2] fixes basic notation and explains the two-groups model. Under that framework, 
Section |3] defines support for one hypothesis over another in terms of a difference between 
the posterior and prior log-odds. Thus, reporting support in a scientific paper enables each 
reader to roughly determine what the posterior probability of either hypothesis would be 
using a different hypothetical value of its unknown prior probability. Section |4] then gives 
two qualifying measures of support, each of which is minimax optimal in a different sense. 
In Section [5} one of the optimal measures is compared to empirical Bayes methodology using 
real proteomics data. That case study addresses the extent to which optimal support on the 
basis of abundance measurements of a single protein can approximate the analogous value 
that would be available in the presence of measurements across multiple proteins. Finally, 
Section |6] closes with a concluding summary. 



2 Preliminaries 

2.1 Distributions given the parameter values 

For all i G {1, . . . , A^}, the observed data vector Xi of n observations is assumed to be the 
outcome of Xi, the random variable of density function /(•|</)j) on sample space Af" for 
some (pi in parameter space $. Hypotheses about (pi, called the full parameter, are stated in 
terms of the subparameter 9i = 9 {(pi), called the parameter of interest, which lies in a set 
0. Consider the member 9q of in order to define the null hypotheses 9i = 9q, . . . , 9i = 9q, 
. . . , 9n = 9q. The conditional density notation reflects the randomness of the parameter to 



be specified in Section 2.2 



A measurable map r : — )■ T yields t,, = r [xi) as the observed value of the ran- 
dom test statistic Tj = r(Xj). The application of the map can often reduce the data to 
a lower-dimensions statistic, but the identity map may be employed if no reduction is de- 
sired: Ti = Xi = T {Xi). In some cases, the map may be chosen to eliminate the nuisance 
parameter, which means the probability density function of Tj, conditional on 9i, may be 
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written as g{*\Oi). Otherwise, the interest parameter is identified with the full parameter 
{9i = 9 {(pi) = (pi), in which case g {•\9i) = f Thus, the following methodology applies 

even when the nuisance parameter cannot be eliminated by data reduction. 

2.2 Hierarchical model 

Let Pi denote the alternative-hypothesis prior distribution, assumed to have measure-theoretic 
support B, and let ttq denote the probability that a given null hypothesis is true. (Unless 
prefaced by measure-theoretic, the term support in this paper means strength of statistical 
evidence rather than what it means in measure theory.) Like most hierarchical models, 
including those of empirical-Bayes and random-effects methods, this two-groups model uses 
random parameters to represent real variability rather than subjective uncertainty: 



where tti = 1 — ttq, and where go = g {*\9o) and gi = j g {•\9) dPi {9) are the null and 
alternative density functions, respectively. 

Let P denote a joint probability distribution of 9 and Tj such that Pi = P {•\9 ^ 9q), 
P {9 = 9q) = ttq, and P {•\9 = 9i) admits g {•\9i) as the density function of Tj conditional 
on 9 = 9i for all 9i E Q. Let Ai denote the random variable indicating whether, for all i = 
1, . . . ,N, the zth null hypothesis is true {Ai = 0) or whether the alternative hypothesis is true 
{Ai = 1). For sufficiently large and sufficient independence between random parameters, 
VTo approximates, with high probability, the proportion of the null hypotheses that are 
true. 

Bayes's theorem then gives 



but that cannot be used directly without knowledge of ttq and of gi, which is unknown since 



Ti ~ TTogo + iTigi, 



(1) 



P {Ai = l\Ti = t,) ^ P{Ai = l)gi {ti) ^ TTi gi {ti) 
P {Ai = 0\Ti = t,) P{Ai = 0)go{ti) 7rogo{ti)' 



(2) 
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Pi is unknown. Since the empirical Bayes strategy of estimating those priors is not always 
feasible (Q, the next section presents an alternative approach for inference about whether 
a particular null hypothesis is true. 



3 General definition of support 

One distribution will be said to surrogate the other if it can represent or take the place 
of the other for inferential purposes. Before precisely defining surrogation, the reason for 
introducing the concept will be explained. Given gl, a probability density function that 
surrogates gi, let P* denote the probability distribution that satisfies both P* {Ai = a) = 
P{Ai = a) for a G {0,1} and 



where T* has the mixture probability density function Tfigl+Trogo rather than that of equation 
Q. Equation g and P* {A, = 1) = P {Ai = 1) entail that P* {Ai = = U) surrogates 
P {Ai = l\Ti = ti) inasmuch as g\ surrogates gi, which is unknown since it depends on Pi. 
Thus, posterior probabilities of hypothesis truth can be surrogated by using g\ in place of gi. 
Although the surrogate posterior probability depends on the proportion P* {Ai = 1) = tti, 
the measure of support to be derived from equation ^ does not require that tti be known 
or even that it be estimated. 

The concept of surrogation will be patterned after that of universality. Let Eg- stand for 
the expectation operator defined by -Eg, (•) = / *dP {•\9 = 9i) = J •g {t\6i) dt. A probability 
density function g^ is universal for the family {g {•\6i) : 6i G 0} if, for any 9i G O, the 
KuUback-Leibler divergence D {g {•\9i) \\g*) = Eg. (log [g {Ti\9i) /g\ {Ti)]) satisfies 



P- {A, = 1\T: = ti) ^ P- {A, = 1) g{ (t,) 
P* {Ai = 0\Tt = ti) P*{Ai = 0)go{ti)' 



(3) 



\imD{g{.\9,) \\gt) /n = 0. 



(4) 
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The terminology comes from the theory of universal source coding (Griinwald, 2007, p. 200); 

is called "universal" because it is a single density function typifying all of the distributions 
of the parametric family. Equation Q may be interpreted as the requirement that the per- 
observation bias in loggl{Ti) as a predictor of log (? (Tj|6'j) asymptotically vanishes. This 
lemma illustrates the concept of universality with an important example: 

Lemma 1. Let U denote a probability distribution that has measure-theoretic support 0. 
The mixture density g defined by g (t) = J g(t\6)dll{6) for all t & T is universal for 

{g{*m:e,ee}. 

Proof. By the stated assumption about 11, there is a B C such that 9i & Q and 

I g {t\e) dll {9) > sup g (t\e) [ dU (9) (5) 

for all G e and t G T. With supg^Q g (t\e^ > g {tlOi) and g (t) = J g {t\e) dU (9), inequality 
(|5| entails that 

liin ^"g^(^) > lim ^^S9m) + logjedIl{e) ^ ^.^ log gm) 



rn>oo Ti n— >-oo n n^oo n 

for all 6i & <d and t E T. While that yields lim„_>oo D {g {•\6i) \\g) /n < 0, the information 
inequality has D {g {•\9i) \\g) > 0. The universality of g then follows from equation Q. (This 



proof generalizes a simpler argument using probability mass functions (Griinwald, 2007, p. 



176).) □ 

Universality suggests a technical definition for surrogation. With respect to the fam- 
ily {^'(•l^t) • ^ ©}) ^ probability density function g' surrogates any probability density 
function g" for which 

\imEeAlog[g'{T,)/g"{T,)])/n = (6) 

n— ^oo 

for all G O. The idea is that one distribution can represent or take the place of another 
for inferential purposes if their mean per-observation difference vanishes asymptotically. The 



following lemma then says that any universal distribution can stand in the place of any other 
distribution that is universal for the same family. It is a direct consequence of equations ^ 
and 

Lemma 2. If the probability density functions g' and g" are universal for {g {•\6i) : 6i G Q}, 
then g' surrogates g" with respect to {g {•\6i) : di G O}. 

The inferential use of one density function in place of another calls for a concept of 
surrogation error. The surrogation error of each probability distribution P* based on the 
probability density function g{ in place of gi is defined by 

, P*{A = l\T* = t) , P{A = l\T, = t) 

' = PHA = o\T: = t) - PiA = o\T. = ty 

Then P* is said to surrogate P if 

lim Ee,e' (T,) /n = (7) 

n—^oo 

for alH = 1, . . . , and a G {0, 1}. Equation ([T]) states the criterion that the per-observation 
bias in log [P* {Ai = 1\T* = Ti) / P* (Aj = 0|T* = Tj)] as a predictor of the true posterior log 
odds asymptotically vanishes. This bias is conservative: 

Proposition 3. // P* is based on a density function g\ on T , then Ee.e* (Ti) < for all 

e, G 0. 

Proof. The following holds for alH = 1, . . . , iV. By equations ^ and ^ with P* [Ai = a) = 
P{Ai = a) for a G {0,1}, 

Ea^e* iT,) = -Dig,\\gl), 
but D {gi\\gl) > by the information inequality. □ 

The next result connects the concepts of surrogation (asymptotic per-observation unbi- 
asedness) and universality. 
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Theorem 4. If P* is based on a density function g\ that is universal for {g {•\9i) : 9i G 0}, 
then it surrogates P. 

Proof. Since Pi has measure-theoretic support O, Lemma [T] implies that gi is universal for 
{9 • 9i G 6}. The universality of gi and g\ for {g {•\9i) : 0j G 0} then entails that g\ 

surrogates gi by Lemma |2] According to equation ([6]), such surrogation means 

\imEeA\og\g\{T,)/g,{T,)])/n = Q. (8) 

n— >oo 

By equations g and Q with {At = a) = P (A^ = a) for a G {0, 1}, 
lim Ee^e'' (T,) /n = lim Ee, (log [g\ (T^) (T,)]) /n, 



which equation (|8| says is equal to 0. 



□ 



The difference in conditional and marginal log-odds, 



(t.) = log 



P* {Ai = 1|T* 



P* {Ai = 0\T* = t. 



U) , P' (A 
-log 



P* (A = o)' 



(9) 



is called the support that the observation ti transmits to the hypothesis that 9i 7^ 9q over 
the hypothesis that 9i = 9q according to P*, which by assumption surrogates P. While the 



concise terminology follows Edwards (1992), the basis on a change in log-odds is that of the 



information for discrimination (Kullback, 1968). Royall (2000a), Blume (2002), and others 



have used the term strength of statistical evidence as a synonym for support in the original 



sense of Edwards (1992). 



Proposition 5. // P* surrogates P based on the universal density function g*, then the 
support that the observation ti transmits to the hypothesis that 9i 7^ over the hypothesis 
that 9i = 9q according to P* is 



(t.) = log 



9o iU) ' 



(10) 
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Proof. Substituting the solution of equation ^ for g* (tj) / (tj) into equation (10) recovers 



equation □ 



Since the support according to P* only depends on P* through its universal density, 
S [ti] g*) = log (tj) /(7o (^j)) is more simply called the support that the observation ti 
transmits to the hypothesis that 6i ^ 6q over the hypothesis that 6i = 6q according to g\. 
Hence, the same value of the support applies to different hypothetical values of ttq and even 
across different density functions as (71, the unknown alternative distribution of the reduced 
data. 



which depends neither on ttq nor on any other aspect of P* apart from g\. Thus, the problem 
of minimizing the surrogation error of P* reduces to that of optimizing the universal density 
g\ on which P* is based. Such optimality may be either with respect to the population 
represented by gi or with respect to the observed sample reduced to ti. The remainder 
of this section formalizes each type of optimality as a minimax problem with a worst-case 
member of {g {•\9i) : 9i ^ in place of the unknown mixture density gi = J g {•\6) dPi [9). 

4.1 Population optimality 

Among all probability density functions on T, let g\ be that which minimizes the maximum 
average log loss 



4 Optimal measures of support 



Equations ^ and ([S]) with P* {Ai = a) = P [Ai = a) for a E {0,1} imply that the surroga- 
tion error of P* is equal to the surrogation error of g^ (t), 



e* {t)=\oggt (t)- log (71 (t) , 




(11) 
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Since the loss at each 6i is averaged over the population represented by the sampling den- 
sity gi, the solution g* will be called the population- optimal density function relative to 
{9 • ^ ©}• That density function has the mixture density 

gl {t) = j g{t\e.;)p\ {9,)d9, 
for all t E T, where p^ is the probability density function on that maximizes 



(Rissanen, 2007, §5.2.1 



The prior density function p\ thereby defined is difficult to compute at finite samples 



but asymptotically approaches the Jeffreys prior (Rissanen, 2009, §2.3.2), which was origi- 
nally derived for Bayesian inference from an invariance argument ( Jeffreysj 1948). Whereas 
Pi is an unknown distribution of parameter values that describe physical reality, p* is a 
default prior that serves as a tool for inference for scenarios in which suitable estimates of 
Pi are not available. Lemma [T] secures the universality of g^, which in turn implies that 
log [g* (ti) / gQ (tj)] qualifies as support by Proposition [5] 

For the observation ti, g^ (ti) may likewise be considered as a default integrated likelihood 



and the support (10) as the logarithm of a default Bayes factor. Drmota and Szpankowski 



(2004) reviewed asymptotic properties of the population-optimal density function and related 



it to the universal density function satisfying the optimality criterion of the next subsection. 



4.2 Sample optimality 

Among all probability density functions on T, let g^ be the one that minimizes the maximum 
worst-case log loss 

1 ^(^1^') fio\ 
sup log^— -. (12) 

eiee,teT 9i [t) 
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Since the regret supg.g@ log [g {ti\6i) / g\ (tj)] incurred by any observed sample tj is no greater 
than that of the worst-case sample, g* will be referred to as the sample- optimal density 



function relative to {g{»\6i) : 9i G 0}. As proved by Shtarkov (1987), the unique solution 
to that minimax problem is 



9i 



Lg(t-e, {t))dt 



(13) 



with the normalizing constant Z = f^g yt; 6i {t)j dt automatically acting as a penalty for 
model complexity, where the maximum likelihood estimate (MLE) for any t G T is denoted by 



6i if) = argsupg.ge^' (^l^'j) (Rissanen 



2007 



Griinwald 



2007). The probability density g^ (ti 



is thus known as the normalized maximum likelihood (NML). Its universality ^ follows from 
the convergence of 



Ee, log 



g{T,\e;)/g(Ti-e, (T, 



+ 



logZ 



n 



n 



n 



to 0, which holds under the consistency of 9i (Tj) since the growth of log Z is asymptotically 
proportional to logn (Rissanen 2007 Griinwald] 2007). Thus, Proposition |5] guarantees that 
log [g{ (ti) /go (ti)] measures support. 

For inference about 6i, the incidental statistics ti, . . . . . . ,tN- provide side in- 



formation or "indirect evidence" (Efron, 2010a) in addition to the "direct evidence" provided 



by the focus statistic t^. The problem of incorporating side information into inference has 



been addressed with the weighted likelihood function Li{»;ti) (Hu and Zidek, 2002; Wang 



and Zidek, 2005) defined by 



N 



log Li {9i; ti) = ^ Wij log g {tj\9i 



(14) 



for all 9i E Q, where the focus weight wu is no less than any of the incidental weights Wi 
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(j 7^ i). For notational economy and parallelism with g {ti\6i), the left-hand side expresses 
dependence on the focus statistic but not on the incidental statistics. 



Replacing the likelihood function in equation (12) with the weighted likelihood function 



while taking the worst-case sample of the focus statistic and holding the incidental statistics 
fixed, has the unique solution 



9ii 



U (g» (*);*) 

j^Li {eiit);t) dt' 



(15) 



where the maximum weighted likelihood estimate (MWLE) for any t G T is denoted by 



6i (t) = argsupgge Li (6; t) (Bickel, 2010b). Accordingly, g^^ will be called the sample- optimal 
density function relative to {g {•\0i) : 6i G 0} and Wn, . . . ,WiN- If Wij = (n + 1)~^ (A^ — 1)^^ 
for all j ^ i and Wu = 1 — (n + 1)^^, then Wn, . . . ,WiN are single-observation weights in 



the sense that '^j^iWij = Wa/n (Bickel, 2010b). In accordance with equation (10), the 
corresponding sample- optimal support is S* (ti) = log [g^^ (tj) /go (tj)]. When data are only 
available for one of the populations, the NMWL using single-observation weights may be 
closely approximated by considering 



logli (^i; ti) = {n + I)-' \ogg{to\e,) + (l - (n + 1)"^) \ogg (ti|^i) 



(16) 



as the logarithm of the weighted likelihood, where to is a pseudo-observation such as the 



mode of Ti under the null hypothesis (Bickel, 2010b). 



The probability density gli (ti) is called the normalized maximum weighted likelihood 
(NMWL). It applies to more general contexts than the NML: there are many commonly used 



distribution families for which f-j-Li {Oi (t) ; t) dt but not f^g (t; 9i (t) j dt is finite (Bickel 



2010b). As with other extensions of the NML to such families ( [Griinwald 2007 Chapter 11), 
conditions under which the NMWL is universal have yet to be established. Thus, Proposition 
[Sjcannot be invoked at this time, and one may only conjecture that S* (ti) satisfies the general 
criterion of a measure of support (® in a particular context. The conjecture is suggested 
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for the normal family by the finding of the next section that g^^ (tj) can closely approximate 
a universal density even for very small samples. 



5 Proximity to simultaneous inference: ci Ccise study 

This section describes a case study on the extent to which support computed on the ba- 
sis of measurements of the abundance of a single protein can approximate the true differ- 
ence between posterior and prior log odds. Since that true difference is unknown, it will 
be estimated using an empirical Bayes method to simultaneously incorporate the available 
abundance measurements for all proteins. 

Specifically, the individual sample-optimal support of each protein was compared to an 
estimated Bayes factor using levels of protein abundance in plasma as measured in the 
laboratory of Alex Miron at the Dana-Farber Cancer Institute. The participating women 
include 55 with HER2-positive breast cancer, 35 mostly with ER/PR-positive breast cancer. 



and 64 without breast cancer. The abundance levels, available in Li (2009), were transformed 



by shifting them to ensure positivity and by taking the logarithms of the shifted abundance 



levels (Bickel, 2010a) 



The transformed abundance levels of protein i were assumed to be IID normal within 
each health condition and with an unknown variance af common to all three conditions. For 
one of the cancer conditions and for the non-cancer condition, /i?^'^'^'^'' and will denote 

the means of the respective normal distributions, and n'^^^'^^^ G {55, 35} and n^^^^^^^ = 64 will 
likewise denote the numbers of women with each condition. Let Tj represent the absolute 
value of the Student t statistic appropriate for testing the null hypothesis of 6i = 0, where 



9i = \6i\ and 



.cancer _ healthy 

6,. 



Oil (m~i + 

the standardized cancer-healthy difference in the population mean transformed abundance 
in the ith protein. Under the stated assumptions, the Student t statistic, conditional on 5j, 
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55 HER2 pos. vs. 64 healthy 



35 ER/PR pos. vs. 64 healthy 
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Figure 1: Single-comparison, sample-optimal support ("minimax"; g\{ti) /go{ti)) as an ap- 
proximation to the estimated support that could be achieved with multiple comparis ons ("si- 

multaneous"; g ^alternative ) I go (ti))- The "upper bound" is maxgfze g {ti\9) /go {U) (Bickel 



2010c), exceeding the optimal support by a constant amount. 



has a noncentral t distribution with n'^'^^^^'^ + n^'^'^^^^^ — 2 degrees of freedom and noncentrality 



parameter 5i (Bickel, 2010a). Thus, because Tj is the absolute value of that statistic, 9i is 



the only unknown parameter of g {•\0i), the probability density function of Tj|^j. 

With that model and test statistic, the NMWL and the corresponding sample-optimal 



support were computed separately for each protein using tj = in equation (16), as in Bickel 



(2010b). For the analysis of the data of all proteins simultaneously, the same model and test 



statistics were used with the two-component mixture model defined by equation ([T| with 
gi = g (• I ^alternative) for somc uukuowu ^alternative ^ ©• The truc alternative density function 
gi was estimated by plugging in the maximum likelihood estimate ^alternative obtained from 
maximizing the likelihood function 



N 



Y\_ i'^Og iti\0) + (1 - TTo) g (ti|6'alternative)) 



i=l 



over 6'aiternative sud ttq (Bickel, 2010a). The results appear in Fig. [Tjand are discussed in the 
next section. 
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6 Discussion 



The proposed framework of evidential support may be viewed as an extension of likelihood- 



ism, classically expressed in Edwards (1992), to nuisance parameters and multiple compar- 



isons. 



Edwards (1992, §3.2) argued that a measure of evidence in data or support for one 



simple hypothesis (sampling distribution) over another should be compatible with Bayes's 
theorem in the sense that whenever real-world parameter probabilities are available, the 
support quantifies the departure of posterior odds from prior odds. The likelihood ratio 
has that property, but the p-value does not since it only depends on the distribution of the 
null hypothesis. As compelling as the argument is for comparing two simple hypotheses, 
the pure likelihood approach does not apply to a composite hypothesis, a set of sampling 
distributions. 



Perceiving the essential role of composite hypotheses in many applications, Zhang (2009) 



previously extended the likelihoodism by replacing the likelihood for the single distribution 
that represents a simple hypothesis with the likelihood maximized over all parameter values 
that constitute a composite hypothesis. Thus, the strength of evidence for the alternative 
hypothesis that is in some interval (or union of intervals) $i over the null hypothesis that 
(p is in some other interval $o would be max^g^^ / (a;j|0) / max<^g$p / (xj|0). For example, the 
strength of evidence favoring 7^ 0o over (p = (pQ would be max^^g^ / (xj|</)) // (a;j|0o)- The 
related approach of Bickel ( 2010c[ ) performs the maximization after eliminating the nuisance 
parameter: maxgge g {ti\6) / g (ti|6'o). While that approach to some extent justifies the use of 



likelihood intervals (Fisher, 1973) and has intuitive support from the principle of inference to 



the best explanation (Bickel, 2010c), it tends to overfit the data from a predictive viewpoint. 



For example, if 9i = argmax^ge^ L [6), then the evidence for the hypothesis that 6 E Qi 
would be just as strong as the evidence for the hypothesis that 9 = 61 even if the latter 
hypothesis were in primary view before observing x. Thus, the maximum likelihood ratio is 
considered as an upper bound of support in Fig. [TJ 

The present paper also generalizes the pure likehhood approach but without such over- 
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fitting. The proposed approach grew out of the Bayes-compatibility criterion of Edwards 



(1992). By leveraging recent advances in J. Rissanen's information-theoretic approach to 
model selection, the Bayes-compatibility criterion was recast in terms of predictive distribu- 
tions, thereby making support applicable to composite hypotheses. To qualify as a measure of 
support, a statistic must asymptotically mimic the difference between the posterior and prior 
log-odds, where the parameter distributions considered are physical in the empirical Bayes 



or random effects sense that they correspond to real frequencies or proportions (Robinson 



1991), whether or not the distributions can be estimated. 



Generalized Bayes compatibility has advantages even when support is not used with 
a hypothetical prior probability. For example, defining support in terms of the difference 
between the posterior and prior log-odds ^ is sufficient for interpreting S* (t,) > 5 or some 



other some level of support in the same way for any sample size (Royall, 2000b). In other 



words, no sample-size calibration is necessary (cf. Bickel, 2010b). 

In addition to the Bayes-compatibility condition, an optimality criterion such as one of 
the two lifted from information theory is needed to uniquely specify a measure of support 
(Q. One of the resulting minimax-optimal measures of support performed well compared 
to the upper bound when applied to measured levels of a single protein (Q. The stan- 
dard of comparison was the difference between posterior and prior log odds that could be 
estimated by simultaneously using the measurements of all 20 proteins. While both the 
minimax support and the upper bound come close to the simultaneous-inference standard, 
the conservative nature of the minimax support prevented it from overshooting the target 
as much as did the upper bound (Fig. [T|. The discrepancy between the minimax support 
and the upper bound will become increasingly important as the dimension of the interest pa- 
rameter increases. In high-dimensional applications, overfitting will render the upper bound 
unusable, but minimax support will be shielded by a correspondingly high penalty factor 



J^g it] 6i if) j dt in equation (13). 
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